Goto

Collaborating Authors

 cross entropy







Variation-Bounded Loss for Noise-Tolerant Learning

Wang, Jialiang, Zhou, Xiong, Liu, Xianming, Hu, Gangfeng, Zhai, Deming, Jiang, Junjun, Li, Haoliang

arXiv.org Artificial Intelligence

Mitigating the negative impact of noisy labels has been a perennial issue in supervised learning. Robust loss functions have emerged as a prevalent solution to this problem. In this work, we introduce the V ariation Ratio as a novel property related to the robustness of loss functions, and propose a new family of robust loss functions, termed V ariation-Bounded Loss (VBL), which is characterized by a bounded variation ratio. We provide theoretical analyses of the variation ratio, proving that a smaller variation ratio would lead to better robustness. Furthermore, we reveal that the variation ratio provides a feasible method to relax the symmetric condition and offers a more concise path to achieve the asymmetric condition. Based on the variation ratio, we reformulate several commonly used loss functions into a variation-bounded form for practical applications.


ForTIFAI: Fending Off Recursive Training Induced Failure for AI Model Collapse

Shabgahi, Soheil Zibakhsh, Aghazadeh, Pedram, Mirhoseini, Azalia, Koushanfar, Farinaz

arXiv.org Artificial Intelligence

The increasing reliance on generative AI models is rapidly increasing the volume of synthetic data, with some projections suggesting that most available new data for training could be machine-generated by 2030 Gartner, Inc. (2022). This shift to a mainly synthetic content presents a critical challenge: repeated training in synthetic data leads to a phenomenon known as model collapse, where model performance degrades over generations of training, eventually rendering the models ineffective. While the causes of model collapse are increasingly understood, effective mitigation strategies remain scarce. We address this challenge by leveraging a key insight: auto-regressive models tend to generate text sequences to which they assign high confidence (i.e., high log-likelihood). Based on this observation, we introduce the Truncated-Cross-Entropy (TCE) loss function. Our experiments demonstrate that models trained with TCE not only learn effectively but also exhibit significantly increased resilience, tolerating over 2.3 more synthetic data before the onset of collapse. In addition, we provide an open-source benchmark for collapse dynamics in mixed-data settings. Our results demonstrate that confidence-aware training objectives can substantially delay collapse onset, offering a practical and generalizable tool for model robustness under synthetic-data exposure. Generative models have become the foundation for modern AI applications in several modalities, including text, image, code, and audio. Large Language Models (LLMs) such as ChatGPT (OpenAI et al., 2024), LLaMA (Grattafiori et al., 2024) and Gemma (Team et al., 2025), as well as image generators DALL-E (Ramesh et al., 2021) and Imagen (Saharia et al., 2022), all rely on large datasets scraped from the Web. As these models are continuously updated to reflect recent knowledge and linguistic patterns, the need for ever larger and frequently refreshed training corpora has grown substantially. However, this demand is colliding with a shift in the data landscape: synthetic content is increasingly populating the Internet, contaminating the very datasets used for model training. This shift raises fundamental concerns.




Response to Reviewer

Neural Information Processing Systems

We sincerely thank Reviewer 1 for referring us to four relevant papers [1-4]. Paper [1] provides a very interesting relationship between Fisher divergence and Stein's operator, whereby Hyvarinen Paper [4] establishes a more general result than that of S. If the model class is well-specified then convergence to the data generating distribution is guaranteed. Fano's inequality also gives lower bounds of model selection/message decoding error (so larger'A kernelized Stein discrepancy for goodness-of-fit tests.' We appreciate Reviewer 2's comments and recommendations. We will do another proof reading and remove typos.